npj Systems Biology and Applications — Latest Matching Preprints

1

In Silico Trial Simulation with Artificial Intelligence-Generated Synthetic Control Cohorts Reproduces Results of a Randomized Controlled Trial in Acute Myeloid Leukemia

Kumar Reddy, K.; Hahn, W.; Winter, S.; Roellig, C.; Mueller-Tidow, C.; Serve, H.; Baldus, C. D.; Fransecky, L.; Schliemann, C.; Burchert, A.; Schaefer-Eckart, K.; Kaufmann, M.; Schetelig, J.; Bornhaeuser, M.; Middeke, J. M.; Eckardt, J.-N.

2026-07-16 health informatics 10.64898/2026.07.15.26358123 medRxiv

Top 2%

0.6%

Show abstract

Rising costs, slow accrual and molecular substratification of cancers necessitate novel clinical trial designs. We demonstrate that artificial intelligence-generated synthetic patients can replace real controls to reproduce results of the SORAML trial. Using external multimodal data from 1,377 acute myeloid leukemia (AML) patients from previous trials and a real-world registry, we fine-tuned a tabular foundation model to generate synthetic patients, reproducing clinical and genetic features and outcome associations. Synthetic patients were then matched to the original SORAML intervention group using Cox risk scores, replacing the original control and reproducing the original trial result with near-identical median event-free survival (EFS) and treatment effect (original hazard ratio [HR] 0.64, 95%-confidence interval [CI] 0.47-0.87, p=0.004; with synthetic control HR 0.66, 95%-CI 0.48-0.90, p=0.009). Our findings demonstrate that AI-generated synthetic patients can serve as statistically rigorous controls supporting novel trial designs.

2

Rationale and guidance for implementing the continual reassessment method for dose-finding in controlled human infection model studies

Weerasinghe, C.; Osowicki, J.; Simpson, J. A.; Crocker-Buque, T.; McCarthy, J.; Williams, E.; Price, D. J.

2026-07-17 infectious diseases 10.64898/2026.07.16.26358128 medRxiv

Top 2%

0.6%

Show abstract

Controlled human infection models (CHIMs) are increasingly used in infectious disease research to study pathogen dynamics and evaluate interventions under controlled conditions. However, these studies are resource-intensive and involve ethical and safety constraints, making efficient study design critical. Dose-finding is a key early component in CHIMs, where the aim is to identify a challenge dose that achieves a target infection probability. Traditional rule-based designs are commonly used but can be inefficient, motivating the use of model-based adaptive approaches such as the Bayesian Continual Reassessment Method (CRM). Although CRM has been extensively studied and widely adopted in Phase I oncology trials for identifying the maximum tolerated dose of therapeutics, its application in CHIM settings remains limited, particularly when the endpoint of interest is infection. This tutorial provides step-by-step guidance for implementing a Bayesian CRM in dose-finding CHIMs, using an oropharyngeal Neisseria gonorrhoeae challenge as a motivating case study. The framework outlines key design components, including dose-grid specification, dose-response model, prior elicitation, Bayesian updating, decision rules, and stopping criteria, with particular emphasis on a clinically interpretable parameterisation. Trial operating characteristics are evaluated through simulation studies under multiple dose-response scenarios and prior-predictive analyses, and compared with a commonly used '3+3' type rule-based design. This work highlights the advantages of Bayesian model-based designs for dose-finding in CHIMs over classic rule-based designs and provides a structured, reproducible framework for implementing CRM, supporting their application in future CHIM studies.

3

Analytical perturbation reveals hidden instability of biological phenotypes

Piorkowska, N. J.; Ostromecki, A.; Franik, G.; Bizon, A.

2026-07-16 endocrinology 10.64898/2026.07.13.26357916 medRxiv

Top 4%

0.4%

Show abstract

Background Unsupervised machine learning has become a cornerstone of computational phenotyping across clinical medicine, genomics, imaging, and multi-omics research. However, phenotype discovery relies on a sequence of analytical decisions - including missing-data handling, preprocessing, dimensionality reduction, clustering methodology, and stochastic initialization - that are rarely evaluated collectively. Although clustering stability has been extensively investigated, the robustness of complete analytical workflows remains largely unexplored. Results We developed an Analytical Perturbation Framework that systematically quantifies the robustness of phenotype discovery by perturbing complete unsupervised learning workflows rather than individual clustering algorithms. Using a real-world cohort of 1,286 women with polycystic ovary syndrome (PCOS), we generated 116 valid analytical pipelines comprising alternative preprocessing strategies, missing-data handling methods, dimensionality reduction approaches, clustering algorithms, and random initializations. Agreement between independently generated phenotype solutions was consistently low (median Adjusted Rand Index = 0.079), indicating substantial sensitivity of phenotype discovery to routine analytical decisions. Variance decomposition identified preprocessing as the largest contributor to phenotype instability (22.8%), followed by clustering methodology (14.6%), whereas stochastic initialization explained only 3.1% of the observed variability. At the patient level, most individuals exhibited reproducible phenotype assignments (median Patient Robustness Score = 0.719), although a substantial subgroup showed markedly lower assignment stability. Feature perturbation analyses identified follicle-stimulating hormone, anti-thyroglobulin antibodies, anti-thyroid peroxidase antibodies, total testosterone, luteinizing hormone, and androstenedione as the strongest contributors to computational robustness, rather than biological importance. Finally, phenotype solutions demonstrating greater computational robustness also exhibited greater biological coherence during independent validation.

4

Aligning Reinforcement Learning with Clinical Practice for Safe Decision Support in Pediatric Sepsis

Bueso, F. G.; Wardle, R.; Manescu, P.; Spear, J.; Ray, S.; Peters, M.

2026-07-21 intensive care and critical care medicine 10.64898/2026.07.20.26358476 medRxiv

Top 4%

0.3%

Show abstract

Offline reinforcement learning (RL) has emerged as a promising framework for clinical decision support in sepsis, yet most existing studies focus exclusively on adult populations, leaving pediatric care largely unexplored despite important physiological and treatment differences. In this work, we develop offline RL policies for pediatric sepsis management in the Pediatric Intensive Care Unit (PICU) using a retrospective cohort of 2,229 episodes from Great Ormond Street Hospital (GOSH), formalized as finite horizon Markov Decision Process (MDP) with joint intravenous fluid and vasopressor actions. To better capture pediatric organ dysfunction dynamics, we incorporate Phoenix 8, a recently proposed pediatric sepsis severity score, as an intermediate reward shaping signal in addition to terminal 90 day mortality. We systematically vary the time step size (4, 8, and 12 hours) and reward structure (terminal 90 day mortality, with and without Phoenix 8 based intermediate shaping), and compare Double Deep Q Networks (DDQN), Conservative Q Learning (CQL), and a behavior cloning (BC) model of clinician practice. CQL consistently exhibits stable learning dynamics and favorable Fitted Q Evaluation estimates, while DDQN is prone to overestimation and instability, particularly at finer temporal resolutions and with dense rewards. CQL policies achieve high action-level agreement with historical clinician decisions for both fluids and vasopressors and reproduce clinically plausible escalation patterns across sepsis severity strata, whereas DDQN policies diverge more frequently toward implausible dosing. Temporal aggregation emerges as a key regularizer: moving from 4 hour to 8 hour bins shortens horizons, smooths reward noise, and improves stability without erasing clinically meaningful dynamics, with 8 hour binning providing the best trade off between policy performance and granularity. Our findings highlight time step size as a core design choice in offline RL for healthcare and provide empirical evidence that alternatives beyond the conventional 4 hour setup can enhance stability and safety while preserving clinical interpretability.

5

Patient-Specific EEG Baseline Establishment Using the E-norms Method for Pediatric Seizure Detection Without Labeled Training Data

Jabre, J. F.

2026-07-16 neurology 10.64898/2026.07.13.26357876 medRxiv

Top 5%

0.3%

Show abstract

The aim of this work is to validate patient-specific EEG baseline establishment using the e-norms method as a screening and retrospective-review tool for seizure detection in pediatric epilepsy. The method was applied to 247 seizure-free EEG recordings (263.92 hours) from 10 patients in the CHB-MIT Scalp EEG Database (ages 3-18). A composite stability metric combining first-derivative dynamics, spectral entropy, variance, and line length was computed per 2-second epoch across 23 channels. Patient-specific detection thresholds were derived from each patient's seizure-free baseline using a weighted statistical procedure. Performance was validated against 72 expert-annotated seizures (2,705 epochs) across 62 seizure files, with durations spanning 6 to 264 seconds (44-fold range). The results show that detection achieved 94.4% event-level sensitivity (68 of 72 seizures; 95% CI 86.6-97.8%) and 81.5% epoch-level sensitivity (2,204 of 2,705 epochs; 95% CI 80.0-82.9%). Eight of ten patients achieved 100% event-level sensitivity with epoch-level sensitivity ranging from 58.7% to 100.0%. Two patients showed partial event-level failures (CHB-15: 17 of 20; CHB-18: 5 of 6), with the four missed events attributable to two characterizable failure modes. Patient-specific thresholds ranged from 4.06 to 4.81 (mean 4.51 +/- 0.25); threshold variation did not correlate reliably with age or sex, confirming that no universal threshold could achieve comparable performance. Detection margins ranged from 0.88 to 1.24 times. Patient-specific e-norms achieves 94.4% event-level sensitivity for pediatric EEG seizure detection without requiring labeled seizure training data, exceeding published human expert inter-rater agreement (50-76%) and recent automated approaches in adult cohorts using behind-the-ear EEG and wearable ECG. Two characterizable failure modes account for the four missed events and inform appropriate clinical use. As a high-sensitivity screening tool complementary to real-time alarm systems, the method is ready for adult validation, prospective deployment, and head-to-head benchmarking.

6

A ReAct Agentic AI System for Natural Language Querying and Statistical Analysis of The Cancer Genome Atlas Clinical Data

Korutla, R.; Amal, S.

2026-07-17 health informatics 10.64898/2026.07.15.26358188 medRxiv

Top 5%

0.3%

Show abstract

The Cancer Genome Atlas (TCGA) holds clinical data for over 11,000 patients across 33 cancer types, but access is hard because of complex file structures, heterogeneous formats, and the need for programming. We present an agentic system for natural language querying and statistical analysis of TCGA clinical data. The system uses a large language model as an autonomous ReAct agent that selects from eight computational tools, including data extraction, descriptive statistics, Kaplan-Meier survival analysis with log-rank tests, hypothesis testing, and verification against the curated TCGA Pan-Cancer Clinical Data Resource (CDR). The agent reasons about intermediate results, adapts its approach, and returns clinically contextualized responses with source attribution and auditable traces. We introduce TCGA-Agent-Bench, 440 queries across five difficulty tiers with ground truth from the independently curated TCGA-CDR, evaluated with dual metrics of numerical accuracy and clinical completeness. The system achieves 93.4% overall accuracy (100% single-patient lookups, 99.1% cohort statistics, 92.8% comparative analyses), outperforming a fixed rule-based pipeline (87.1%), a single-pass LLM (81.8%), and retrieval-augmented generation (66.9% on a subset). Most of the benchmark is answerable from the CDR alone, so we locate the extraction layer's value in fields the CDR lacks (drug treatments, TNM components, biomarkers, biospecimen metadata): on 26 queries targeting these, the full system answers 100% versus 3.8% for CDR-only. Ablations show the reasoning loop is most impactful (+9.1% accuracy, +22.0 completeness points). A tool-based agentic architecture enables accurate, auditable analysis of clinical repositories, with value driven by tool design and recovered fields rather than model scale.

7

Hypertension Phenotypes in a National Database: A Three-Axis State Model Integrating Diagnosis, Treatment Intensity, and Blood Pressure Control (The NDB-K7Ps-Study-8)

nakajima, K.; Sekine, A.

2026-07-19 cardiovascular medicine 10.64898/2026.07.16.26358276 medRxiv

Top 6%

0.2%

Show abstract

Hypertension is commonly defined as a binary condition despite substantial heterogeneity in diagnosis, treatment, and blood pressure (BP) control. We propose a three-axis state model integrating diagnosis status, treatment intensity, and BP control to better characterize hypertension phenotypes. The framework generates 27 possible states that can be condensed into seven clinically meaningful groups. We applied the model to 5,129,584 Japanese adults using the National Database of Health Insurance Claims and Specific Health Checkups. Hierarchical cluster analysis, sensitivity analysis excluding patients with cardiovascular diseases other than hypertension, and validation against antihypertensive medication use were performed. Overall, 64% of participants were classified as normotensive, whereas 36% belonged to hypertension-related groups, including 11% with unrecognized hypertension and 7% with diagnosed but untreated hypertension. Agreement with data-driven hierarchical cluster analysis was substantial (weighted {kappa}=0.87). The group distribution remained largely unchanged in the sensitivity analysis, supporting the robustness of the proposed classification. Hypertension diagnosis also showed high validity, with a sensitivity of 96.5%, specificity of 91.8%, and substantial agreement with antihypertensive medication use ({kappa}=0.78). This three-axis framework provides a robust and clinically interpretable approach for characterizing hypertension phenotypes, enabling systematic identification of care gaps and supporting research, clinical decision-making, and population health management.

8

Statistical Inference and Power Analysis for Comparative F1 and Fβ Scores under Correlated Classifier Pairs

Hsu, C.-Y.; Liu, Q.; Shyr, Y.

2026-07-17 dermatology 10.64898/2026.07.15.26358166 medRxiv

Top 6%

0.2%

Show abstract

As machine learning and artificial intelligence systems are increasingly used in healthcare, rigorous evaluation of their classification performance has become critical. The F1 and F{beta} scores are widely adopted metrics for assessing performance in imbalanced biomedical data. Recently, we introduced psF1, a unified statistical framework for inference and study design for single and comparative F1 and F{beta} scores under the assumption of independent classifiers. In practice, however, benchmarking two classifiers on the same dataset creates a correlated paired setting. Ignoring this intrinsic dependency leads to overestimation of the standard error and a substantial loss of statistical power. To address this, we develop psF1pair, an advanced framework for statistical inference and power analysis that explicitly accounts for correlations between classifier pairs. Extensive simulation studies demonstrate the performance of psF1pair, and its utility is further illustrated through application to a real-world imaging classification system. As expected, higher correlation between classifiers yields narrower confidence intervals and enhanced statistical power. A freely available R package is provided to facilitate implementation, supporting accurate evaluation and study design for predictive and classification models in biomedical research.

9

Brain Network Excitability Predicts Clinical Severity in Multiple Sclerosis

Amato, L. G.; Angiolelli, M.; Demuru, M.; Troisi Lopez, E.; Quarantelli, M.; Granata, C.; Depannemaecker, D.; Jirsa, V.; Bonavita, S.; Mazzoni, A.; Sorrentino, P.

2026-07-16 neurology 10.64898/2026.07.10.26357763 medRxiv

Top 6%

0.2%

Show abstract

Comprehensive biomarkers of multiple sclerosis (MS) capable of simultaneously diagnosing the condition, capturing symptom severity and predicting treatment efficacy remain elusive. Although several studies have highlighted the pivotal role played by demyelinating lesions in determining MS structural pathology, their relationship with symptom severity is limited. Here, we combined personalized computational brain modeling with magnetoencephalography (MEG) recordings from 17 MS patients and 20 healthy controls (CTR) to derive personalized brain network excitability parameters, which we tested as MS biomarkers. Personalized parameters discriminated between CTR and MS participants with high accuracy, also classifying between progressing and remitting MS patients. Notably, they also predicted MS clinical scales across multiple domains. In all clinical tasks, personalized parameters consistently outperformed standard clinical measures and total lesion loads. Together, these results highlight the potential of personalized brain modelling in deriving integrative MS biomarkers, capable of simultaneously identifying the condition, classifying MS subtypes and predicting symptom severity. d brain modelling in deriving integrative MS biomarkers, capable of simultaneously identifying the condition, classifying between MS subtypes and predicting the severity of symptomatology.

10

Muscle proteins in plasma associate to distinguished phenotypes in amyotrophic lateral sclerosis

Azizi, L.; Aksoylu, I.; Bueno Alvez, M.; Foucher, J.; Juto, A.; Seitz, C.; Press, R.; Samuelsson, K.; Kläppe, U.; Uhlen, M.; Edfors, F.; Bergström, S.; Fang, F.; Nilsson, P.; Öijerstedt, L.; Manberg, A.; Ingre, C.

2026-07-16 neurology 10.64898/2026.07.14.26357727 medRxiv

Top 6%

0.2%

Show abstract

Background: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease characterized by death of upper and lower motor neurons, usually presented with clinical heterogeneity. Fluid biomarker development remains dominated by neurofilament light chain (NEFL), a marker of neuroaxonal injury. NEFL is however unspecific to ALS and its phenotypes and there is currently a lack of biomarkers that capture ALS heterogeneity such as onset site and ALS-frontotemporal spectrum disorder (ALS-FTSD). Therefore, we investigated whether plasma proteomics could reveal pathway-level signatures that stratify and explain ALS heterogeneity. Methods: We profiled ~5,400 plasma proteins (Olink Explore HT) in 299 patients with ALS and 50 age- and sex comparable healthy controls. We used two complementary analytic frameworks: (i) differential protein abundance analysis to identify altered proteins in ALS and across clinical subgroups, and (ii) weighted gene correlation network analysis (WGCNA) to identify coordinated protein modules and relate them to ALS diagnosis and to ALS-specific clinical traits (site of onset, ALS-FTSD, ALS functional rating scale-revised (ALSFRS-R) score, and plasma NEFL). Results: Differential abundance analysis identified 56 proteins altered in ALS versus controls, of which 40 were increased. WGCNA identified 11 co-expression modules, with ALS samples having the strongest correlation to a protein module (n=51) highly enriched for muscle-related proteins. Out of the 40 proteins that had increased expression levels, 29 overlapped with the muscle-enriched protein module, indicating that muscle related proteins are the dominant circulating proteomic signature in ALS. This signal extended to clinical stratification: spinal-onset patients showed a strong positive association with the muscle-module. Further, differential abundance analysis of spinal- versus bulbar-onset ALS identified changes that mapped predominantly to the same module, supporting a molecular signature of onset phenotype. In contrast, cognitive status (ALS-FTSD) mapped to distinct modules enriched for extracellular matrix/cell-adhesion pathways, consistent with a separable biological axis of disease heterogeneity. Although multiple modules correlated with NEFL, trait-specific signatures were not fully explained by neuroaxonal injury. Notably, the muscle-enriched module increased with higher NEFL and lower ALSFRS-R, supporting its interpretation as a severity-linked, muscle-involvement proxy. Conclusions: Large-scale plasma proteomics reveals that heterogeneity in ALS reflects underlying biological structures. We identified a dominant muscle-associated protein network that distinguished ALS patients from controls and correlated with disease onset phenotype and severity, alongside distinct protein networks linked to ALS-FTSD. By integrating differential protein abundance with network-based analysis, we defined pathway-level biomarker signatures that extend beyond NEFL, enabling biologically informed patient stratification and improved therapeutic monitoring.

11

Multi-model forecasting of respiratory disease activity in Germany during the 2024-2025 season

Bracher, J.; Wolffram, D.; Amaral Lind, R.; Bardeck, N.; Boehm, M.; Contreras, S.; Doenges, P.; Guenther, F.; Kaiser, R.; van de Kassteele, J.; Kuhlmann, A.; Lange, B.; Nemcova, B.; Priesemann, V.; Reinacher, U.; Rodiah, I.; Sandmann, F.; the RESPINOW Study Group, ; Schienle, M.

2026-07-21 epidemiology 10.64898/2026.07.20.26358471 medRxiv

Top 7%

0.2%

Show abstract

Respiratory diseases cause considerable morbidity in autumn and winter and are a priority in public health monitoring. In Germany, they are subject to a number of surveillance systems, including both pathogen-specific and syndromic indicators. In this paper we present a collaborative multi-target and multi-model real-time forecasting system rolled out during the 2024/25 season, and discuss differences to earlier efforts carried out during the COVID-19 pandemic. A total of nine models were run to generate forecasts of general practitioner consultations for acute respiratory infections (ARI), hospitalizations for severe acute respiratory infections (SARI) and confirmed cases of seasonal influenza and RSV. As all indicators were subject to retrospective revisions, forecasting models were combined with a nowcasting step. Whenever multiple models were available for the same indicator, we combined them into an ensemble. Nowcasts showed convincing performance, even though for some models Christmas break effects led to an upward bias in early January. Forecasts were overall well-calibrated and most models outperformed simple benchmark models. These improvements were generally more substantial for age-stratified than pooled targets, and concentrated at lead times of two to three weeks. Anticipating the peak timing and magnitude proved to be challenging, with many models predicting too flat curves with a too early turnaround (e.g. already in late January rather than mid-February for SARI). The combined ensemble forecast was among the best-performing approaches, but unlike in previous related projects did not consistently outperform individual models. We conclude by discussing learnings on the organization of collaborative forecasting projects in post-COVID-19 times and the potential of AI-supported modelling.

12

Efficient stochastic epidemic simulation via the Sellke construction

van Boven, M.; Bootsma, M. C.

2026-07-17 epidemiology 10.64898/2026.07.16.26358219 medRxiv

Top 7%

0.2%

Show abstract

Stochastic epidemic models are a cornerstone of infectious disease epidemiology and are often used to study intervention scenarios. However, large run-to-run variability can make intervention effects difficult to estimate precisely. We revisit the epidemic Sellke construction, which assigns each individual an infection threshold for the cumulative infection hazard such that, conditional on the thresholds, the epidemic trajectory becomes deterministic. This enables coupling of simulations with and without an intervention, yielding low-variance effect estimates even when outcomes such as final size or peak incidence vary widely between runs. We develop an exact, event-driven implementation that maintains infection and recovery events in priority queues. Cumulative infection-hazard updates require O(log N) time per event, yielding overall complexity O(Elog N) for E events in a population of size N. The implementation achieves computational performance comparable to the classical Gillespie algorithm while naturally accommodating non-Markovian infectious periods and complex infectiousness profiles. We illustrate the approach using distance-dependent spread of avian influenza between poultry farms in the Netherlands and a multilayer population with households, schools, and workplaces. In both examples, coupling enables efficient within-run comparisons of intervention scenarios across stochastic realisations.

13

Association between stage-specific sleep bout durations and obstructive sleep apnea severity: A variable-domain functional regression approach

Rahman, M. M.; Guha Niyogi, P.

2026-07-16 epidemiology 10.64898/2026.07.14.26358060 medRxiv

Top 7%

0.1%

Show abstract

The apnea-hypopnea index (AHI), the conventional metric of obstructive sleep apnea (OSA) severity, is typically studied using scalar summaries of sleep architecture, such as the total time spent in each sleep stage. Although clinically interpretable, these summaries fail to capture the temporal organization of overnight sleep-stage sequences and may obscure stage-specific associations with OSA severity. Modeling the complete sleep-stage trajectory provides substantially richer temporal information; however, because total sleep duration varies across individuals, sleep-stage trajectories are observed over subject-specific domains, limiting the applicability of conventional functional regression methods that assume a common observation interval. We therefore applied Variable-Domain Functional Regression (VDFR) to overnight polysomnographic data from the APPLES study (n= 1,103), treating the epoch-by-epoch sleep-stage sequence as a continuous, variable-length functional predictor of AHI. We compared three levels of sleep-stage granularity: five stages (Wakefulness, N1, N2, N3, REM), three stages (Wakefulness, Non-REM, REM), and binary staging (Wakefulness vs. Sleep). Functional sleep-stage terms were significant across all staging granularities and model structures (all p-values [≤]0.001). Wake, N1, and N2 were positively associated with AHI, whereas N3 and REM were negatively associated, with REM exhibiting the strongest association. These effects were attenuated under coarser staging representations, highlighting the importance of preserving fine-grained sleep architecture. To our knowledge, this is the first application of VDFR to overnight polysomnographic data in OSA, showing that accommodating subject-specific sleep durations enables the identification of stage-specific temporal associations with AHI severity that are attenuated or obscured by coarser staging and conventional scalar analyses.

14

Multi-Agent Dynamic Refinement Outperforms Static RAG in Clinical Reasoning for Complex Nephrology Cases

Yano, Y.; Kakizaki, H.; Nagasu, H.; Kishi, S.; Koshida, T.; Nihei, Y.; Hirano, A.; Sugawara, Y.; Imaizumi, T.; Osakabe, Y.; Sakaguchi, Y.; Nangaku, M.; Mori, H.; Naito, T.; Ohashi, M.; Maruyama, S.; Matsui, I.; Isaka, Y.; Okada, H.; Suzuki, Y.; Kashihara, N.

2026-07-16 nephrology 10.64898/2026.07.15.26358121 medRxiv

Top 8%

0.1%

Show abstract

Background: Large language models (LLMs) struggle with dynamic, longitudinal clinical reasoning. We developed a Multi-Stage Iterative Clinical Reasoning Agent framework to address this gap and systematically decouple the clinical efficacy of static retrieval-augmented generation (RAG) from dynamic self-refinement. Methods: Ten complex longitudinal nephrology cases, rigorously selected via a modified Delphi consensus technique, were blindly evaluated by four board-certified nephrologists and a multi-model AI panel. We compared three architectures across nine cognitive steps: (Model A) a baseline frontier LLM, (Model B) an LLM augmented with static guideline-based RAG, and (Model C) our proposed multi-agent framework featuring RAG integrated with iterative self-critique and refinement. Results: In human evaluations (20-point scale), Model C (mean 17.2, SD 1.2) significantly outperformed both Model A (16.1, 1.3) and Model B (16.2, 1.2) (P < 0.001). Implementing static RAG (Model B) yielded no significant improvement over the baseline. Automated AI evaluations (15-point scale) corroborated these findings: Model C (14.7, 0.6) outscored Model A (14.2, 0.9, P < 0.001) and Model B (14.3, 0.9, P = 0.01). While monolithic models exhibited severe score degradations in planning-heavy tasks such as dynamic differential diagnoses, the multi-agent framework effectively intercepted error cascades, achieving significantly higher diagnostic accuracy (mean 17.6, P = 0.019) and therapeutic management scores (17.3, P = 0.002). Conclusions: Static knowledge retrieval alone fails to enhance frontier LLM performance in longitudinal medical reasoning. Distributing clinical workflows into a multi-agent dynamic refinement pipeline significantly improves reasoning completeness, intercepts error cascades, and safely resolves planning bottlenecks in complex patient care.

15

Large Language Model - Enhanced Decision Tree Framework for Identifying Multiple Sclerosis Diagnoses from Clinical Documentation

Venkatesh, S.; DelSignore, M.; Wu, X.; Morris, M.; Kerr, W. T.; Visweswaran, S.; Wang, Y.; Xia, Z.

2026-07-17 neurology 10.64898/2026.07.14.26357416 medRxiv

Top 10%

0.1%

Show abstract

Background. Early diagnosis and intervention are crucial in multiple sclerosis (MS), yet diagnostic delays are common. Large language models (LLMs) such as generative pre-trained transformers (GPTs) may help streamline diagnostic workflows by extracting MS diagnostic signals from clinical notes. Objective. To derive MS diagnosis status from the first neurology note using a computable algorithm based on the 2017 McDonald criteria and applying GPT-4 for node-level reasoning within a structured decision framework. Methods. We analyzed first neurology notes from 125 randomly selected patients (including those with MS, related disorders, and controls) enrolled in a clinic cohort between 2017 and 2023. We included the clinical history and diagnostic testing sections but redacted the assessment and plan. We converted the 2017 McDonald criteria into a decision tree and provided expert-curated clinical knowledge to guide GPT-4 reasoning at each decision node. GPT-4 generated binary decisions at each node to traverse the tree and classified MS diagnoses at terminal nodes. We evaluated performance against neurologist-assessed diagnoses and characterized hallucinations (non-factual, incongruent, irrelevant, over-reliant, and logical reasoning errors). Results. In this study cohort (mean age 40{+/-}13 years; 81% women) representative of the clinic population, GPT-4 performed well in predicting MS diagnosis (84% accuracy, 79% precision, 74% recall, 91% specificity) using first neurology notes. Hallucinations occurred in 32 cases (26%), most commonly incoherence (75%) and overreliance (47%). Conclusion. A structured, LLM-guided decision framework can flag MS diagnoses from early clinical documentation. Large-scale studies are needed to mitigate hallucinations, validate this approach, and test implementation in clinical settings.

16

FoodScribe: an open-source semantic framework for nutrient estimation from free-text dietary records

Gouda, H.; Sala Climent, M.; Agongo, J.; Gaikwad, S. P.; Nattakom, A.; Zhao, H. N.; Xing, S.; Boland, B. S.; Holt, T.; Guma, M.; Dorrestein, P. C.

2026-07-17 nutrition 10.64898/2026.07.15.26358181 medRxiv

Top 11%

0.1%

Show abstract

Efficiently summarizing dietary records at scale remains a persistent bottleneck in nutritional epidemiology. We present FoodScribe, which translates free-text meal descriptions into quantitative nutrient profiles by combining ingredient parsing with nutrient retrieval by querying the USDA FoodData Central (FDC) database. Benchmarked using three LLM providers using Nutribench dataset, FoodScribe completed annotation of 3,807 meal descriptions in 2.5 hours, a task otherwise requiring substantial manual effort from trained nutritionists. FoodScribe achieved accuracy across macronutrient estimation (F1=0.79-0.89), with models performing better for protein than fat estimation. Application to a Mediterranean diet intervention cohort indicated dietary shifts consistent with the intervention pattern based on model-derived estimates. Integration with metabolomics data suggested that fiber and vegetable intake were positively associated with a fecal metabolite cluster.

17

Diversity and Utilization Patterns of Medicinal Plants Used in the Management of Diabetes Mellitus: An Ethnobotanical Study in Selected Communities in Sierra Leone

Kamara, S.; Jimmy, A. I.; Gary, L. P.

2026-07-21 pharmacology and therapeutics 10.64898/2026.07.18.26358386 medRxiv

Top 11%

0.1%

Show abstract

Background: Diabetes mellitus is an increasing public health challenge in Sierra Leone, where access to diagnosis, treatment, and long-term care remains limited. Traditional medicine continues to play a significant role in disease management; however, ethnobotanical knowledge related to diabetes remains insufficiently documented. Methods: A cross-sectional ethnobotanical survey was conducted among 40 informants, including traditional healers, herbalists, and knowledgeable community members in Waterloo, Pendembu, and Bo. Data were collected using structured questionnaires administered via Kobo Toolbox and paper-based tools. Information on medicinal plants, plant parts used, preparation methods, routes of administration, and knowledge transmission pathways was obtained. Quantitative ethnobotanical indices, including Frequency of Citation (FC), Relative Frequency of Citation (RFC), and Informant Consensus Factor (ICF), were calculated. Results: A total of 21 medicinal plant species were documented. The most frequently cited species were Moringa oleifera (FC = 9; RFC = 0.225), Vernonia amygdalina (FC = 7; RFC = 0.175), and both Cassia siberiana and Telfairia occidentalis (FC = 6; RFC = 0.150). Leaves were the most commonly utilized plant part (40.9%), and decoction was the predominant preparation method (76.2%), with oral administration accounting for 95.2% of use. The Informant Consensus Factor (ICF = 0.69) indicated a relatively high level of agreement among informants. Knowledge was primarily transmitted through apprenticeship and inherited family practices. Conclusion: Traditional medicinal plants remain an important component of diabetes management in Sierra Leone. The high level of consensus among informants and the repeated citation of specific plant species suggest structured and culturally validated therapeutic practices. The findings provide a foundation for future phytochemical and pharmacological investigations and highlight the need for documentation, preservation, and sustainable utilization of ethnobotanical knowledge.

18

Learned ultrasound segmentation and deformable CT fusion for augmented reality endovascular surgery

Dillon, T. M.; Quevedo Moreno, D.; Rutherford, E. K.; Ayers, B.; Salomon, B.; Kubi, B.; Thomas, J.; Roche, E.

2026-07-17 cardiovascular medicine 10.64898/2026.07.15.26358084 medRxiv

Top 11%

0.1%

Show abstract

Minimally invasive endovascular procedures offer reduced surgical trauma, shorter recovery times, and improved outcomes, but rely on 2D fluoroscopic X-ray imaging, which provides limited depth perception and exposes patients and clinicians to ionizing radiation. Here we present an augmented reality (AR) system that fuses intravascular ultrasound (IVUS) and electromagnetic (EM) position tracking with preoperative computed tomography (CT) to produce an anatomically accurate, deformation-corrected navigational reference. A robotic device performs ECG-gated pullback of the IVUS probe, capturing 4D aortic motion across the cardiac cycle. We introduce a deep learning architecture for extracting vascular lumen boundaries and side-branch orifices from artifact-prone IVUS streams, and a semantically driven non-rigid CT-IVUS fusion pipeline robust to false positive landmarks. We evaluate the platform with trained surgeons in benchtop phantom studies and in-vivo ovine models, and demonstrate its application to fenestrated endovascular aneurysm repair (FEVAR). Compared to fluoroscopy alone, AR guidance significantly reduces cannulation time, radiation exposure, and cognitive workload, while improving procedural efficiency and safety. Our IVUS-EM and CT aortic datasets are released open source.

19

Nocturnal cough as a syndromic surveillance signal for respiratory illness in England

Irons, T.; Carlsson, E.; Tang, M. L.; Mellor, J.; Rubin, C.; Allen, A.; Elliot, A. J.; Kageback, M.; Packham, J.

2026-07-21 epidemiology 10.64898/2026.07.20.26357937 medRxiv

Top 12%

0.1%

Show abstract

We evaluated aggregated, privacy-preserving smartphone-detected nocturnal cough activity from the Sleep Cycle application as a potential syndromic surveillance signal in England. Weekly cough metrics from January 2023 to January 2026 were compared with UK Health Security Agency indicators: NHS 111 acute respiratory infection (ARI) triage calls, influenza and COVID-19 PCR positivity, and hospital admission rates for influenza, COVID-19, and respiratory syncytial virus. We evaluated total cough counts alongside two population-normalised metrics, coughs per user and coughs per hour of sleep, and assessed temporal relationships nationally and regionally using cross-correlation with prewhitening. The strongest and most consistent associations were observed for NHS 111 ARI triage calls, where population-normalised cough metrics showed raw national correlations of approximately 0.95 and retained prewhitened correlations above 0.55 at lag 0. This indicates that nocturnal cough activity closely tracks short-term variation in an established syndromic surveillance indicator, beyond shared seasonality, long-term trends, and autocorrelation. Similar near-contemporaneous patterns were observed across regions. Population-normalised cough metrics also showed epidemiologically plausible leading associations with pathogen-specific indicators: coughs per hour of sleep peaked one week before influenza PCR positivity, while both coughs per user and coughs per hour of sleep peaked one week before COVID-19 PCR positivity. Hospital-based indicators showed weaker and more heterogeneous relationships, but the normalised cough metrics still showed plausible temporal alignment with influenza and COVID-19 admissions, including contemporaneous associations with influenza admissions and short leading associations with COVID-19 admissions. In contrast, unnormalised total cough counts produced less stable and often non-interpretable lag structures, consistent with sensitivity to variation in observation volume. These findings suggest that passive, near-real-time nocturnal cough monitoring can provide a population-level signal of respiratory symptom burden, with greatest utility as a broad syndromic indicator that complements surveillance sources affected by healthcare-seeking behaviour, laboratory turnaround times, backfilling, and reporting delays.

20

Mathematical Modeling of Rift Valley Fever in the Sahelian Zone

Djimramadji, H.; Ndonane, B.; Djaouga, P.; MARKHOUS, H. M.; Djoumountanan, E.; TOBAYE, K.; Abakar, F. M.

2026-07-17 epidemiology 10.64898/2026.07.15.26358164 medRxiv

Top 12%

0.1%

Show abstract

We develop a mathematical model of Rift Valley Fever integrating mosquito vectors, ruminants, and humans, based on an SEIR-type structure with vertical transmission in vectors. Local data from the Sudanian and especially the Sahelian zones are used to capture the impact of climatic variations on mosquito population dynamics. The mathematical analysis establishes the models positivity, determines the basic reproduction number R0, and demonstrates the local and global stability of the disease-free equilibrium. Sensitivity analysis (PRCC) highlights the most influential parameters, while the stochastic approach using a continuous-time Markov chain confirms the major role of seasonal rainfall. Numerical simulations reveal a peak in animal and human infections around the 9th month, correlating with periods of heavy rainfall. This model provides a relevant tool for surveillance and prevention within a "One Health" approach in Chad.